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Abstract 



One of the most popular algorithms for clustering in Euclidean space is the fc-means algo- 
rithm; fc-means is difficult to analyze mathematically, and few theoretical guarantees are known 
about it, particularly when the data is well-clustered. In this paper, we attempt to fill this gap 
in the literature by analyzing the behavior of fc-means on well-clustered data. In particular, 
we study the case when each cluster is distributed as a different Gaussian - or, in other words, 
when the input comes from a mixture of Gaussians. 

We analyze three aspects of the fc-means algorithm under this assumption. First, we show 
that when the input comes from a mixture of two spherical Gaussians, a variant of the 2-means 
algorithm successfully isolates the subspace containing the means of the mixture components. 
Second, we show an exact expression for the convergence of our variant of the 2-means algorithm, 
when the input is a very large number of samples from a mixture of spherical Gaussians. Our 
analysis does not require any lower bound on the separation between the mixture components. 

Finally, we study the sample requirement of fc-means; for a mixture of 2 spherical Gaussians, 
we show an upper bound on the number of samples required by a variant of 2-means to get 
close to the true solution. The sample requirement grows with increasing dimensionality of 
the data, and decreasing separation between the means of the Gaussians. To match our upper 
bound, we show an information-theoretic lower bound on any algorithm that learns mixtures of 
two spherical Gaussians; our lower bound indicates that in the case when the overlap between 
the probability masses of the two distributions is small, the sample requirement of fc-means is 
near- optimal. 



1 



1 Introduction 



One of the most popular algorithms for clustering in Euclidean space is the /c-means algorithm [Llo82, 
For65, Mac67j; this is a simple, local-search algorithm that iteratively refines a partition of the in- 
put points until convergence. Like many local-search algorithms, /c-means is notoriously difficult 
to analyze, and few theoretical guarantees are known about it. 

There has been three lines of work on the /c-means algorithm. A first line of questioning 
addresses the quality of the solution produced by /c-means, in comparison to the globally optimal 
solution. While it has been well-known that for general inputs the quality of this solution can 
be arbitrarily bad, the conditions under which /c-means yields a globally optimal solution on well- 
clustered data are not well-understood. A second line of work [AV061 IVat09] examines the number 
of iterations required by /c-means to converge. [Vat 09 j shows that there exists a set of n points 
on the plane, such that /c-means takes as many as 0(2 n ) iterations to converge on these points. A 
smoothed analysis upper bound of poly (n) iterations has been established by [AMR09] . but this 
bound is still much higher than what is observed in practice, where the number of iterations are 
frequently sublinear in n. Moreover, the smoothed analysis bound applies to small perturbations 
of arbitrary inputs, and the question of whether one can get faster convergence on well-clustered 
inputs, is still unresolved. A third question, considered in the statistics literature, is the statistical 
efficiency of /c-means. Suppose the input is drawn from some simple distribution, for which /c-means 
is statistically consistent; then, how many samples is required for /c-means to converge? Are there 
other consistent procedures with a better sample requirement? 

In this paper, we study all three aspects of /c-means, by studying the behavior of /c-means on 
Gaussian clusters. Such data is frequently modelled as a mixture of Gaussians; a mixture is a 
collection of Gaussians V = {D\, . . . , D^} and weights u>i, . . . ,Wk, such that ]TV Wi = 1. To sample 
from the mixture, we first pick i with probability Wi and then draw a random sample from D{. 
Clustering such data then reduces to the problem of learning a mixture] here, we are given only 
the ability to sample from a mixture, and our goal is to learn the parameters of each Gaussian Di, 
as well as determine which Gaussian each sample came from. 

Our results are as follows. First, we show that when the input comes from a mixture of two 
spherical Gaussians, a variant of the 2-means algorithm successfully isolates the subspace containing 
the means of the Gaussians. Second, we show an exact expression for the convergence of a variant 
of the 2-means algorithm, when the input is a large number of samples from a mixture of two 
spherical Gaussians. Our analysis shows that the convergence-rate is logarithmic in the dimension, 
and decreases with increasing separation between the mixture components. Finally, we address the 
sample requirement of /c-means; for a mixture of 2 spherical Gaussians, we show an upper bound on 
the number of samples required by a variant of 2-means to get close to the true solution. The sample 
requirement grows with increasing dimensionality of the data, and decreasing separation between 
the means of the distributions. To match our upper bound, we show an information-theoretic lower 
bound on any algorithm that learns mixtures of two spherical Gaussians; our lower bound indicates 
that in the case when the overlap between the probability masses of the two distributions is small, 
the sample requirement of 2-means is near-optimal. 

Additionally, we make some partial progress towards analyzing /c-means in the more general 
case - we show that if our variant of 2-means is run on a mixture of k spherical Gaussians, then, 
it converges to a vector in the subspace containing the means of D^. 

The key insight in our analysis is a novel potential function 9f, which is the minimum angle 
between the subspace of the means of Di, and the normal to the hyperplane separator in 2-means. 
We show that this angle decreases with iterations of our variant of 2-means, and we can characterize 
convergence rates and sample requirements, by characterizing the rate of decrease of the potential. 
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Our Results. More specifically, our results are as follows. We perform a probabilistic analysis of 
a variant of 2-means; our variant is essentially a symmetrized version of 2-means, and it reduces to 
2-means when we have a very large number of samples from a mixture of two identical spherical 
Gaussians with equal weights. In the 2-means algorithm, the separator between the two clusters is 
always a hyperplane, and we use the angle Ot between the normal to this hyperplane and the mean 
of a mixture component in round t, as a measure of the potential in each round. Note that when 
Of = 0, we have arrived at the correct solution. 

First, in Section [31 we consider the case when we have at our disposal a very large number of 
samples from a mixture of N(fi 1 , (o- 1 ) 2 /^) and N(fj? , (ct 2 ) 2 /^) with mixing weights /O 1 ,^ 2 respec- 
tively. We show an exact relationship between 9t and Ot+i, for any value of n J , a 3 , p? and t. Using 
this relationship, we can approximate the rate of convergence of 2-means, for different values of 
the separation, as well as different initialization procedures. Our guarantees illustrate that the 
progress of /c-means is very fast - namely, the square of the cosine of 9t grows by at least a constant 
factor (for high separation) each round, when one is far from the actual solution, and slow when 
the actual solution is very close. 

Next, in Section^ we characterize the sample requirement for our variant of 2-means to succeed, 
when the input is a mixture of two spherical Gaussians. For the case of two identical spherical 
Gaussians with equal mixing weight, our results imply that when the separation [i < 1, and when 
samples are used in each round, the 2-means algorithm makes progress at roughly the same 
rate as in Section [3j This agrees with the ^(ypr) sample complexity lower bound [Lin96] for learning 
a mixture of Gaussians on the line, as well as with experimental results of |SSR06j . When \i > 1, 
our variant of 2-means makes progress in each round, when the number of samples is at least 

Then, in Section \5\ we provide an information-theoretic lower bound on the sample requirement 
of any algorithm for learning a mixture of two spherical Gaussians with standard deviation 1 and 
equal weight. We show that when the separation fx > 1, any algorithm requires ^(^r) samples to 
converge to a vector within angle 9 = cos _1 (c) of the true solution, where c is a constant. This 
indicates that /c-means has near-optimal sample requirement when fi > 1. 

Finally, in Section [6j we examine the performance of 2-means when the input comes from 
a mixture of k spherical Gaussians. We show that, in this case, the normal to the hyperplane 
separating the two clusters converges to a vector in the subspace containing the means of the 
mixture components. Again, we characterize exactly the rate of convergence, which looks very 
similar to the bounds in Section (3) 

Related Work. The convergence-time of the /c-means algorithm has been analyzed in the worst- 
case [AV061 IVat09j . and the smoothed analysis settings [MR091 I AMR09| : |Vat09| shows that 
the convergence-time of /c-means may be f2(2 n ) even in the plane. [AMR09] establishes a 0(n 30 ) 
smoothed complexity bound. [ORSS06] analyzes the performance of /c-means when the data obeys 
a clusterability condition; however, their clusterability condition is very different, and moreover, 
they examine conditions under which constant-factor approximations can be found. In statistics 
literature, the /c-means algorithm has been shown to be consistent [Mac67j. [Pol81] shows that 
minimizing the /c-means objective function (namely, the sum of the squares of the distances between 
each point and the center it is assigned to), is consistent, given sufficiently many samples. As 
optimizing the /c-means objective is NP-Hard, one cannot hope to always get an exact solution. 
None of these two works quantify either the convergence rate or the exact sample requirement of 
/c-means. 

There has been two lines of previous work on theoretical analysis of the EM algorithm [D LR77] . 
which is closely related to /c-means. Essentially, for learning mixtures of identical Gaussians, the 
only difference between EM and /c-means is that EM uses partial assignments or soft clusterings, 
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whereas fc-means does not. First, [RW841 TXJ96] views learning mixtures as an optimization prob- 
lem, and EM as an optimization procedure over the likelihood surface. They analyze the structure 
of the likelihood surface around the optimum to conclude that EM has first-order convergence. An 
optimization procedure on a parameter m is said to have first-order convergence, if, 

||mt_|_i — m*\ \ < R ■ \ \rrit — m*\\ 

where mt is the estimate of m at time step t using n samples, m* is the maximum likelihood 
estimator for m using n samples, and R is some fixed constant between and 1. In contrast, our 
analysis also applies when one is far from the optimum. 

The second line of work is a probabilistic analysis of EM due to [DSOOj : they show a two- 
round variant of EM which converges to the correct partitioning of the samples, when the input is 
generated by a mixture of k well-separated, spherical Gaussians. For their analysis to work, they 
require the mixture components to be separated such that two samples from the same Gaussian 
are a little closer in space than two samples from different Gaussians. In contrast, our analysis 
applies when the separation is much smaller. 

The sample requirement of learning mixtures has been previously studied in the literature, but 
not in the context of A:-means. [CHRZ07, Cha07] provides an algorithm that learns a mixture of 
two binary product distributions with uniform weights, when the separation \x between the mixture 
components is at least a constant, so long as f2(-Tr) samples are available. (Notice that for such 
distributions, the directional standard deviation is at most 1.) Their algorithm is similar to k- 
means in some respects, but different in that they use different sets of coordinates in each round, 
and this is very crucial in their analysis. Additionally, [BCOFZ07] show a spectral algorithm which 
learns a mixture of k binary product distributions, when the distributions have small overlap in 
probability mass, and the sample size is at least £l(d/[i 2 ). |Lin96] shows that at least ^(^r) samples 
are required to learn a mixture of two Gaussians in one dimension. 

We note that although our lower bound of £l{d/u?) for \i > 1 seems to contradict the upper 
bound of [CHRZ07j[Cha07j . this is not actually the case. Our lower bound characterizes the number 
of samples required to find a vector at an angle 6 = cos -1 (l/10) with the vector joining the means. 
However, in order to classify a constant fraction of the points correctly, we only need to find a 
vector at an angle 6' = cos _1 (l//x) with the vector joining the means. Since the goal of [CHRZ07J 
is to simply classify a constant fraction of the samples, their upper bound is less than 0(d/fj, 2 ). 

In addition to theoretical analysis, there has been very interesting experimental work due 
to [SSR06| . which studies the sample requirement for EM on a mixture of k spherical Gaussians. 
They conjecture that the problem of learning mixtures has three phases, depending on the number 
of samples : with less than about samples, learning mixtures is information-theoretically hard; 

with more than about -4r samples, it is computationally easy, and in between, computationally 
hard, but easy in an information-theoretic sense. Finally, there has been a line of work which 
provides algorithms (different from EM or /c-means) that are guaranteed to learn mixtures of 
Gaussians under certain separation conditions - see, for example, |Das99l IVW021 IAK051 IA~M05 . 
KSV05, CR08, BV08j. For mixtures of two Gaussians, our result is comparable to the best results 
for spherical Gaussians |VWQ2] in terms of separation requirement, and we have a smaller sample 
requirement. 

2 The Setting 

The fc-means algorithm iteratively refines a partitioning of the input data. At each iteration, k 
points are maintained as centers; each input is assigned to its closest center. The center of each 
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cluster is then recomputed as the empirical mean of the points assigned to the cluster. This 
procedure is continued until convergence. 

Our variant of /c-means is described below. There are two main differences between the actual 
2-means algorithm, and our variant. First, we use a separate set of samples in each iteration. 
Secondly, we always fix the cluster boundary to be a hyperplane through the origin. When the 
input is a very large number of samples from a mixture of two identical Gaussians with equal mixing 
weights, and with center of mass at the origin, this is exactly 2-means initialized with symmetric 
centers (with respect to the origin). We analyze this symmetrized version of 2-means even when 
the mixing weights and the variances of the Gaussians in the mixture are not equal. 

The input to our algorithm is a set of samples S, a number of iterations N, and a starting 
vector uq, and the output is a vector un obtained after iV iterations of the 2-means algorithm. 

2-means-iterate(5, N, uq) 

1. Partition S randomly into sets of equal size Si, ... , Sjy. 

2. For iteration t = 0, . . . , N — 1, compute: 



Compute: Ut+i as the empirical average of Ct+i- 

Notation. In Sections [3] and 01 we analyze Algorithm 2-means-iterate, when the input is generated 
by a mixture V = {D 1 ,D 2 } of two Gaussians. We let D x = N{p l , (cx 1 ) 2 ^), D 2 = N(p 2 , (a 2 ) 2 I d ), 
with mixing weights p 1 and p 2 . We also assume without loss of generality that for all j, > 1. 
As the center of mass of the mixture lies at the origin, p 1 p 1 + p 2 p? = 0. In Section [6j we study a 
somewhat more general case. 

We define b as the unit vector along p 1 , i.e. b = .Henceforth, for any vector v, we use the 
notation v to denote the unit vector along v, i.e. v = A. Therefore, ut is the unit vector along 

lit- We assume without loss of generality that p} lies in the cluster Ct+\- In addition, for each t, 
we define 9t as the angle between p 1 and ii[.We use the cosine of 9t as a measure of progress of the 
algorithm at round t, and our goal is to show that this quantity increases as t increases. Observe 
that < cos(#t) < 1, and cos(#t) = 1 when ut and p 1 are aligned along the same direction. For 
each t, we define if = {pP,u t ) = (pi, b) cos(# t ). Moreover, from our notation, cos(# t ) = ji^nr- I* 1 
addition, we define p m \ a = min,- p 3 ', p m \ a = min,- \ \p?\\, and <7 max = max^ gK For the special case of 
two identical spherical Gaussians with equal weights, we use /x = \ \p l \ \ = ||^ 2 ||- Finally, for a < b, 
we use the notation $(a, b) to denote the probability that a standard normal variable takes values 
between a and b. 

3 Exact Estimation 

In this section, we examine the performance of Algorithm 2-means-iterate when one can estimate 
the vectors ut exactly - that is, when a very large number of samples from the mixture is available. 
Our main result of this section is Lemma [U which exactly characterizes the behavior of 2-means- 
iterate at a specific iteration t. 

For any t, we define the quantities & and nit as follows: 




{x G S t+ i\(x,u t ) > 0} 
{x £ S t+ i\{x,u t ) < 0} 
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Figure 1: Here we are depicting the plane defined by the vectors /i 1 and Ut- The vector Vt is simply the 
unit vector along a 1 — {y},u t )u t . Therefore, we have t\ = ||//|| cos(£? t ) and a/IIa* 1 !! 2 — ( T t) 2 — Hm 1 !! sin(#t). 



Now, our main lemma can be stated as follows. 
Lemma 1. 

cos 2 (fl m ) = cos 2 (0 t ) I l+tan z (0 t ) 



2, n \ 2,o\(i .4. 2ra\ 2 cos ( G t)Cfm t + m£ 



if + 2cos{e t )itm t +m] 



The proof is in the Appendix. Using Lemma HJ we can characterize the convergence rates and 
times of 2- means- iterate for different values of fi 3 , p 3 and a 3 , as well as different initializations of 
u . 

The convergence rates can be characterized in terms of two natural parameters of the problem, 
M = which measures how much the distributions are separated, and V = p 3 ® 3 , 

which measures the average standard deviations of the distributions. We observe that as a 3 > 1, 
for all j, V > 1 always. To characterize these rates, it is also convenient to look at two different 
cases, according to the value of p? , the separation between the mixture components. 



Small fjp . First, we consider the case when each | l/x- 7 1 |/<t j ' is less than a fixed constant yln^, 

including the case when \\p 3 \\ can be much less than 1. In this case, the Gaussians are not even 
separated in terms of probability mass; in fact, as | \fi 3 \ \/a 3 decreases, the overlap in probability 
mass between the Gaussians tends to 1. However, we show that 2- means- iterate can still do 
something interesting, in terms of recovering the subspace containing the means of the distributions. 
Theorem [2] summarizes the convergence rate in this case. 



Theorem 2 (Small fj). Let \\p 3 \\/a 3 < yln^, for j = 1,2. Then, there exist fixed constants 
a\ and 02, such that: 

cos 2 (^)(l + oi(M/V) sin 2 (0 t )) < cos 2 (fl m ) < cos 2 (0 t )(l + a 2 (M/V) sm 2 (9 t )) 

For a mixture of two identical Gaussians with equal mixing weights, we can conclude: 

Corollary 3. For a mixture of two identical spherical Gaussians with equal mixing weights, stan- 
dard deviation 1, if fi= \ \^\\ = \\p 2 \\ < ^In-^, then, 

cos 2 (6'i)(l + a'i/2 2 sin 2 (0t)) <cos 2 (0 m ) <cos 2 (0 i )(l + a' 2 /i 2 sin 2 (# t )) 

The proof follows by a combination of Lemma [TJ and Lemma [251 From Corollary [31 we observe 
that cos 2 (#t) grows by a factor of (1 + 0(/x 2 )) in each iteration, except when 6 t is very close to 0. 
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This means that when 2-means-iterate is far from the actual solution, it approaches the solution 
at a consistently high rate. The convergence rate only grows slower, once fc-means is very close to 
the actual solution. 



Large p? . In this case, there exists a j such that H/cr? > yln^. In this regime, the Gaussians 
have small overlap in probability mass, yet, the distance between two samples from the same 
distribution is much greater than the separation between the distributions. Our guarantees for this 
case are summarized by Theorem |4l 

We see from Theorem 0] that there are two regimes of behavior of the convergence rate, depend- 
ing on the value of max.,- \rl\la 3 . These regimes have a natural interpretation. The first regime 
corresponds to the case when t is large enough, such that when projected onto Ut, at most a 
constant fraction of samples from the two distributions can be classified with high confidence. The 
second regime corresponds to the case when Of is close enough to such that when projected along 
ut, most of the samples from the distributions can be classified with high confidence. As expected, 
in the second regime, the convergence rate is much slower than in the first regime. 



Theorem 4 (Large u 3 ). Suppose there exists j such that \\fj\\/a^ > ylnj-. If \rl\jo~i < 



In 7^:, for all j, then, there exist fixed constants a^, 04, 05 and a§ such that: 

( 1 + ^Tlf^^ < cos^ +1 ) < (l + a5i{M/V) + {M/V)2) ^ 



a 4 + (M/V) 2 cos 2 (6t) J ~ K ~ a 6 + (M/V) 2 cos 2 (8 t ) 



On the other hand, if there exists j such that \tI\Iu 3 > y In ^ , then, there exist fixed constants a-? 
and ag such that: 



cos 2 (0 t )(l + a l p2 ™f™« tan 2 (g t )) < cos 2 (fl m ) < cos 2 (^)(l + tan 2 (0 t )) 
a 8 V 2 + p 2 in < j. 



For two identical Gaussians with standard deviation 1, we can conclude: 
Corollary 5. For a mixture of two identical Gaussians with equal mixing weights, and standard 



deviation 1, if fx = \\p 1 \\ = \\p 2 \\ > yln^ ; and if \t^ \ = |r 2 | < yln^; ; then, there exist fixed 
constants a' 3 , a'^, a' 5 , a' e such that: 

ct4u 4 sin 2 (0 t ) \ 2 ,„ , </x 4 sm 2 (8 t ) \ 

cos 1 + / 1 4 m \ ^ cos %i < cos 2 (9 t ) 1 + ^ 4 \y 



On the other hand, if \t}\ = |r 2 | > yln^, then, there exists a fixed constant a' 7 such that: 

cos\e t )(l + a' 7 t&n\6 t )) < cos 2 (6 t+1 ) < cos 2 (^)(l + tan 2 (^)) 

In this case as well, we observe the same phenomenon: the convergence rate is high when we 
are far away from the solution, and slow when we are close. Using Theorems [2] and 01 we can 
characterize the convergence times of 2-means-iterate; for the sake of simplicity, we present the 
convergence time bounds for a mixture of two spherical Gaussians with equal mixing weights and 
standard deviation 1. We recall that in this case 2-means-iterate is exactly 2-means. 

Corollary 6 (Convergence Time). If9o is the initial angle between u 1 anduo, then, cos 2 (6n) > 

1 — e after N = Cq ■ f in(7+tJ? J~ + in(i+e) ^ iterations, where Co is a fixed constant. 
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Effect of Initialization. As apparent from Corollary[6l the effect of initialization is only to ensure 
a lower bound on the value of cos(#o)- We illustrate below, two natural ways by which one can 
select uq, and their effect on the convergence rate. For the sake of simplicity, we state these bounds 
for the case in which we have two identical Gaussians with equal mixing weights and standard 
deviation 1. 

• First, one can choose uq uniformly at random from the surface of a unit sphere in R rf ; in 
this case, cos 2 (6*o) = with constant probability, and as a result, the convergence time 
to reach cos"^/^) is 0( te( fi.^) )- 

• A second way to choose tin is to set it to be a random sample from the mixture; in this 
case, cos^.j = 9(<i±^) with constant probability, and the time to reach C0S->(1/^ is 

^(ln(l+/^))- 



4 Finite Samples 

In this section, we analyze Algorithm 2-means-iterate, when we are required to estimate the statis- 
tics at each round with a finite number of samples. We characterize the number of samples needed 
to ensure that 2-means-iterate makes progress in each round, and we also characterize the rate of 
progress when the required number of samples are available. 

The main result of this section is the following lemma, which characterizes 6t+i, the angle 
between fi 1 and the hyperplane separator in 2-means-iterate, given Qf Notice that now 9t is a 
random variable, which depends on the samples drawn in rounds 1, . . . , t — 1, and given 9t, 6t+i is 
a random variable, whose value depends on samples in round t. Also we use ut+i as the center of 
partition Ct in iteration t + 1, and E[ut+i] is the expected center. Note that all the expectations 
in round t are conditioned on 6 t . In addition, we use St+i to denote the quantity E[X • lxec t +i]i 
where lxeC t +i 1S the indicator function for the event X £ Ct+i, and the expectation is taken over 
the entire mixture. Note that, St+i = E[u t +i] Pr[A £ Cj+i] = Z t+ iE[u t+ i]. We use St+i to denote 
the empirical value of St+i- 

Lemma 7. If we use n samples in iteration t, then, given 9t, with probability 1 — 28, 

2 la A~>n™ 2 t()\ fl L+ Q r,2/7n 2cos(6>i)gim t +mf ^ ( A 2 cos 2 (6> t )+2Ai(m t +c;t cos(flt)) 



cos (9t+i) > cos (6 t ) ( 1 + tan (6 t ) y , , > . , ■ ■» ,, , 

' v z > cos(0 t )£iT7t t +m|+A 2 / \ ™i+£t + 2 ^tmt cos(0t)+A 2 



where, 
Ai 



81og(4n/<5)(cj max + maxj \ \^\ 



>n 

A miog^Sn/^K^ + E.II^II 2 ) S\og{n/5) ( , 

A 2 = 1 7= ffmax it+i + max (6 t +i,At J ) ) 

n yjn j 



The main idea behind the proof of Lemma [7] is that we can write cos 2 ((?t_|_i^ 



Next, we can use Lemma IT] and the definition of S^i to get an expression for ijg~^jjjj|^r[p > a nd 

Lemmas [8] and [9] to bound (St+i — St+i,^ 1 ), and H^+iH 2 — ||5( + i|| 2 . Plugging in all these values 
gives us a proof of Lemma [71 We also assume for the rest of the section that the number of samples 
n is at most some polynomial in d, such that log(n) = 0(log(<i)). 

The two main lemmas used in the proof of Lemma [7J are Lemmas [8] and [9j To state them, we 
need to define some notation. At time t, we use the notation 
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Lemma 8. For any t, and for any vector v with norm \ \v\\, with probability at least 1 — 5, 
\(S t+ i - S t+1 ,v)\ < 



;log(4n/<5)(<T max ||v|| +max j 



jn 

Lemma 9. For any t, with probability at least 1 — 5, 

no 112/iic ,,2 128l °S 2 (8n/5)(a 2 m; , x d + Zj(^) 2 ) 161og(8n/<5) . .. Q .. 
\\S t +i\\ 2 < \\St+i\\ + + )- ; (^ max ||5 m ||+max|(5 m ,/x J )|) 



The proofs of Lemmas [8] and [9] are in the Appendix. Applying Lemma we can characterize 
the number of samples required such that 2-means-iterate makes progress in each round for different 
values of ||/U J ||. Again, it is convenient to look at two separate cases, based on 



Theorem 10 (Small /P). Let ||^||/cj- ? ' < yln^r, for all j. If the number of samples drawn in 
round t is at least ag<7 2 Y log 2 (d/5) ( , ,.. d 4 ,„ , + .. 9 . 4* — otttt I , for some fixed constant ag, 

» max o \ 1 / yMVsm (0t) M J Bm (fit) cos- 2 (at) J J J 

then, with probability at least 1 — 5, cos 2 (# i+i ) > cos 2 (# t )(l + aio(M /V) sin 2 (8 t )) , where a\o is 
some fixed constant. 

In particular, for the case of two identical Gaussians with equal mixing weights and standard 
deviation 1, our results implies the following. 



Corollary 11. Let /1 = 1 1 /lx 1 1 1 = ||/i 2 || < win 7^. If the number of samples drawn in round t is at 



least aglog id/5) ^2 S i n 4(g ) + jx 4 cos 2 (e t )sin 4 (e ) )> f or some fi xe d constant ag, then, with probability 
at least 1 — 5, cos 2 (6>t + i) > cos 2 (^)(l + aiofi 2 s'm 2 (6 t )), where a%o is some fixed constant. 

In particular, when we initialize no with a vector picked uniformly at random from a d- 
dimensional sphere, cos 2 (#o) > h, with constant probability, and thus the number of samples 
required for success in the first round is 0(-^). This bound matches with the lower bounds for 
learning mixtures of Gaussians in one dimension [Lin96] . as well as with conjectured lower bounds 
in experimental work [SSR06] . The following corollary summarizes the total number of samples 
required to learn the mixture with some fixed precision, for two identical spherical Gaussians with 
variance 1 and equal mixing weights. 



Corollary 12. Let /x = \\^ l \\ = ||u 2 || < y m ^F- Suppose uq is chosen uniformly at random, and 
the number of rounds is N > Cq • ( w^^l ln(l+e) )' where Co is the fixed constant in Corollary 
// the number of samples \S\ is at least: N ' a9d ^°s ( rf ) then, with constant probability, after N 
rounds, cos 2 {9 n) > 1 — £• 



One can show a very similar corollary when uq is initialized as a random sample from the 
mixture. We note that the total number of samples is a factor of iV ~ ^ times greater than the 
bound in Theorem 1101 This is due to the fact that we use a fresh set of samples in every round, 
in order to simplify our analysis. In practice, successive iterations of fc-means or EM is run on the 
same data-set. 



Theorem 13 (Large (J). Suppose that there exists some j such that | | |/<7 jf > yln^, and 
suppose that the number of samples drawn in round t is at least 

2 / da^ o- max + maXj H^ll 2 gg^maxj \\^\\ 2 + maxj |H[ 4 \ 

'" llo * ( '' M l ' ' ^^) + M 2 cos 2 (^)sin 4 (^) + Pmin< m sm 4 (0 t ) J 
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for some constant an. U\ T t \ — y m 2%' f or a ^ 3> then, with probability at least 1 — 5 , cos 2 (9t+i) > 
cos 2 (#t)(l + a\2 min(l, M 2 + MV) sin 2 (#t)); otherwise, with probability at least 1 — 5, cos 2 (#t+i) > 
cos 2 (#t)(l + ai3 Pn ^-2 A ^ mi " ta ° ); where an and ai3 are fixed constants. 

'P min^min 

For a mixture of two identical Gaussians with equal mixing weights and standard deviation 1, 
our result implies: 



Corollary 14. Suppose that fi = Wfi 1 ]] = \\^ 2 \\ > i/ln^- ; and suppose that the number of samples 



in round t is at least: aii log, 2 (d/5) ( 9 d A , n , H — 5 — . 4//1 . ) , for some constant an. If < 



ln^r, then, with probability at least 1 — 5, cos 2 (9t+i) > cos 2 (9t)(l + ai2 sin 2 (0 t )); otherwise, with 
probability 1 — 5, cos 2 (^+i) > cos 2 (^)(l + 013 tan 2 (#t)), where ai2 and 013 are fixed constants. 

Again, if we pick no uniformly at random, we require about fi(^r) samples for the first round 
to succeed. When u > 1, this bound is worse than -4, but matches with the upper bounds 
of [BCOFZ07] . The following corollary shows the number of samples required in total for 2-means- 
iterate to converge. 



Corollary 15. Let /1 > yln^. Suppose uq is chosen uniformly at random and the number of 
rounds is N > Cq ■ (hid + i^jq^y); where Cq is the constant in Corollary® If \S\ is at least 
2NCo ^2°F ^ 1 then, with constant probability, after N rounds, cos 2 (#tv) > 1 — e. 

5 Lower Bounds 

In this section, we prove a lower bound on the sample complexity of learning mixtures of Gaussians, 
using Fano's Inequality Yu!)7 GTQ5], stated in Theorem 1191 Our main theorem in this section can 
be summarized as follows. 

Theorem 16. Suppose we are given samples from the mixture -D(u) = ^J\f(^,Id) + |AA(— /i, Id), 
for some \i, and let (j, be the estimate of \i computed from n samples. If ' n < for some constant 
C, and ||u|| > 1, then, there exists u such that E^^H/z — jl\\ > C"||u||, where C is a constant. 

The main tools in the proof of Theorem 1161 are the following lemmas, and a generalized version 
of Fano's Inequality [( 7T051 lYu97] . 

Lemma 17. Let Zii,U2 £ R d , and let Di and D2 be the following mixture distributions: Di = 
±/V(/ii, J d ) + W(-HiJd), and D 2 = iN^Jd) + hM(-fi 2 ,Id)- Then, 



KL(Di,D 2 ) < -^L ■ \ U2W 2 - Hwll 2 + In 2 + 2|| / ui||( e -^ 1 H 2 / 2 + v^||/ii||$(0, ||/n| 



Lemma 18. There exists a set of vectors V = {vi, . . . ,vk} in R d with the following properties: 
(1) For each i and j, d(vi,Vj) > \,d{vi,-Vj) > J. (2) K = e d / w . (3) For all i, \\vi\\ < J\. 
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Theorem 19 (Fano's Inequality). Consider a class of densities F, which contains r densities 
fl, . . . , f r , corresponding to parameter values 81, . . . ,9 r . Let d(-) be any metric on 9, and let 9 be 
an estimate of 9 from n samples from a density f in F. If, for all i and j, d(9i,9j) > a, and 
KL(/j,/,) < [3, then, ma,Xj~Ejd(9,9j) > ^-(1 — ^^"^ ) , where Ej denotes the expectation with 
respect to distribution j. 

Proof. (Of Theorem [To]) We apply Fano's Inequality. Our class of densities F is the class of all 
mixtures of the form ^M(p', Id) + p', Id)- We set the parameter 9 = //, and d{pi\,^2) = I — 

1 1 - We construct a subclass T = {/i, . . . , f r } of F as follows. We set each fi = }^N{\\[i\\vi, Id) + 
}^M{— \\yb\\vi, Id), for each vector Vi in V in Lemma [T51 Notice that now r = e rf//1 °. Moreover, for 
each pair i and j, from Lemma [T71 and Lemma [181 KL(/j,/j) < Ci||^|| 2 + C2, for constants C\ 
and C2. Finally, from Lemma [TH| for each pair i and j, d(fii,fj,j) > The Theorem now follows 
by an application of Fano's Inequality [T9j 



6 More General fc-means 

In this section, we show that when we apply 2-means on an input generated by a mixture of k 
spherical Gaussians, the normal to the hyperplane which partitions the two clusters in the 2-means 
algorithm, converges to a vector in the subspace A4 containing the means of mixture components. 
We assume that our input is generated by a mixture of k spherical Gaussians, with means jx 3 , 
variances (<7 J ) 2 , j = 1, . . . , k, and mixing weights p l , . . . , p k . The mixture is centered at the origin 
such that Yl P ' ^ = 0- We use Ai to denote the subspace containing the means p 1 , . . . , p k . We use 
Algorithm 2-means- iterate on this input, and our goal is to show that it still converges to a vector 
in M. 

Notation. In the sequel, given a vector x and a subspace W, we define the angle between x and W 
as the angle between x and the projection of x onto W. We examine the angle 9t, between ut and 
A4, and our goal is to show that the cosine of this angle grows as t increases. Our main result of 
this section is Lemma [20| which exactly defines the behavior of 2-means- iterate on a mixture of k 
spherical Gaussians. Recall that at time t, we use ut to partition the input data, and the projection 
of ut along M is cos(#t) by definition. Let b\ be a unit vector lying in the subspace M such that: 
ut = cos{9t)b\ + sm{9t)vt, where vt lies in the orthogonal complement of Ai, and has norm 1. We 
define a second vector uj~ as follows: uj- = sin(9t)bj — cos(9t)v t . We observe that (ut,u^-) = 0, 
II^H = 1, and the projection of vit~ on A4 is sin(^)6^.We now extend the set {b\} to complete 
an orthonormal basis B = {bj, . . . , b^ 1 } of A4. We also observe that {b\, . . . , b\~ x , ut, uf} is an 
orthonormal basis of the subspace spanned by any basis of M. , along with vt , and can be extended 
to a basis of H d . 

For j = 1, . . . , k, we define r/ as follows: rj = (fi J \u t ) = cos(9 t )(fi J , b\). Finally we (re)-define 
the quantity £t, and define m\, for I = 1, . . . ,k — las 

(D -(T t J ) 2 /2(<7J) 2 j 

& = — 7^ — > m t = E^(-^<°°)^> 

Our main lemma is stated below. The proof is in the Appendix. 
Lemma 20. At any iteration t of Algorithm 2-means-iterate, 

2cos(^)&m t 1 + £ i (mi) 2 



cos 2 (0 m ) = cos z {9 t ) 1 + tan 2 (# t 



$ + 2008(9^1 + EM) 2 
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Appendix 

6.1 Proof of Lemma [1] 

In this section, we prove Lemma HJ First, we need some additional notation. 
Notation. We define, for j = 1, 2: 

w J t+1 = Pt[x ~ Dj\x e C t +i] 
u J t+1 = E[x\x ~ Dj,x £ Ct+i] 

We observe that ut+i now can be written as: 

ut+i = w] +1 u\ +1 + w 2 t+l u 2 t+l 

Moreover, we define Z t+ i = Pr[x G Ct+i]- 

Proof of Lemma [TJ We start by providing exact expressions for w\ +1 and w 2 +l with respect to 
the partition computed in the previous round t. These are used to compute the projections of ut+i 
along the vectors ut and \i\ — {p,\,u%)Ut, which finally leads to a proof of Lemma[TJ 



■ pJ«J>( t,oo) 

Lemma 21. In round t, for j = 1,2, w J t+1 = — 
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Proof. We can write: 

j _ Pr[x G C t+1 \x ~ Dj] Pr[x ~ 



<+1 " " ' Pr[x£ C t+ i] 

We note that Pr[x ~ Dj] = />*', and Pr[x G Ct+i] = Zt+i- 

As .Dj is a spherical Gaussian, for any x generated from Dj, and for any vector y orthogonal to 
u t , {y,x) is distributed independently from (ut,x). Moreover, we observe that {ut,x) is distributed 
as a Gaussian with mean (//•?', u t ) = t\ and standard deviation oK Therefore, 

Pr[x G C t +i\x ~ £>,-] = Pr [{u t ,x) > 0] = Pr[iV(r/, a J ) > 0] = $( — L ,oo) 

from which the lemma follows. 

Lemma 22. For any t, <«t+i,«t) = ^ +m ^ 0t \ 

Proof. Consider a sample x drawn from Dj. Then, (x,u t ) is distributed as a Gaussian with mean 
(fi^ut) = Tf and standard deviation aK We recall that Pr[x G Cj+i] = -Zt+i. Therefore, (u J t+1 ,ut) 
is equal to: 

E[x,x G C m |x ~ D.-l 1 f°° 1^? 

-dy 



Pr[x G C t+ i\x ~ Dj] Pr [iV(r/,o-J) > 0] 7^=0 o^V^ 
which is, again, equal to: 

i / ■ r°° e -{y-<?m^f f _ j\ -(y-Tf ) a /2(^-) 2 n 

3 r t / ^ + / f; . dy 



poo 
Jy=0 



tJ*(— L,oo)+/ ^ " , = dy 



c- 7 7^=0 o-- 7 v2tt 

We can compute the integral in the equation above as follows. 



We can now compute (ut+i,ut) as follows. 

1 / r j • e -(r/) 2 /2(^') 2 \ 

<«t+i,tit) = wt+i(t4+i.«*> +wt+i(« t+ i, in) = ^— -2J ^At$(-^,°o) + p-V J ) 2 — — J 

The lemma follows by recalling t/ = (fj, b) cos(^) and plugging in the values of m t and 

Lemma 23. Lei i;j 6e a unii vector along fii — (fii,ut)ut- Then, (114+1, vt) = T±^^ll m j n addition, 
for any vector z orthogonal to v,t and vt, (ut+i, z) = 0. 
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Proof. We observe that for a sample x drawn from distribution D\ (respectively, D%) and any 
unit vector v\, orthogonal to u t , {x,vi) is distributed as a Gaussian with mean (fjr,vi) ((fj?,vi), 
respectively) and standard deviation a 1 (resp. a 2 ). Therefore, the projection of Uf+i on vt can be 
written as: 

{u t+1 ,v t ) = = — — V^f — t 1 ,oo){^,v t ) 

3 3 

from which the first part of the lemma follows. 

The second part of the lemma follows from the observation that for any vector z orthogonal to 
u t and v t , (n j ,z) = 0, for j = 1, 2. 

Lemma 24. For any t, 

||//||(&cos(0t) + mt) 



Zt+i 



l|uml1 " (w 

Proof. As we have an infinite number of samples, 0t+\ lies on the same plane as Of Therefore, 
we can write (ut+i,^ 1 ) = («t+i, n t )(/u 1 , u t ) + (ut+i,Vt)(fJr ,vt). Moreover, we can write ||«j + i|| 2 = 
(ut+i,u t } 2 + (ut+i,Vt} 2 . Thus, the first two equation follow by using Lemma [221 and \23\ and recalling 
that (/j, ,Ut) = t\ = ||^ 1 ||cos(^) and (/x 1 ,^) = | | sin(# t ). 

We are now ready to complete the proof of Lemma [TJ 
Proof. (Of Lemma ED By definition of 9 t+1 , cos 2 (# m ) = rr^^Xpr . Therefore, 



I 1 ii 2 2/n 

\H II COS (Ot+i, 



(ut+i,^ 1 ) 2 

IK+ill 2 

/ h 2 A , (ut+i,^ 1 ) 2 - \\fi 1 \\ 2 cos 2 {e t )\\u t+1 \\ 2 

{t> V ||^|| 2 cos 2 (0 t )lk+i|| 2 

/ 1n2 A , ||/x 1 || 2 sin 2 (^)(m 2 + 2^m 4 cos(^)) 

) n — 



•1 



2 



= H/i 1 !^ cos z (# t ) l+tan z (^) n -pj 

V IK+ill 

where we used Lemma and the observation that cos(#t) = l| Tf 1|l . The Lemma follows by replacing 

11/^11 

||iii+i|| 2 using the expression in Lemma [ 



The next Lemma helps us to derive Theorem [2] from Lemma [TJ It shows how to approximate 
r, r) when r is small. 



2e-- 2 / 2 ^ 2 



Lemma 25. Let r < \ hx-Z-. Then, -j=t < $(-t,t) < -4=r. In addition, 2e ' > 
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6.2 Proofs of Sample Requirement Bounds 

For the rest of the section, we prove Lemmas [5] and El which lead to a proof of Lemma [7J First, 
we need to define some notation. 

Notation. At time t, we use the notation St+i to denote the quantity E[X ■ lxeC t+ i]> where 
1-XeCt+i i s the indicator function for the event X £ Ct+\, and the expectation is taken over the 
entire mixture. 

In the sequel, we also use the notation St+i to denote the empirical value of St+i- Our goal is 
to bound the concentration of certain functions of St+i around their expected values, when we are 
given only n samples from the mixture. Recall that we define 9t+\ as the angle between fi 1 and the 
hyperplane separator in 2-means- iterate, given Of Notice that now Ot is a random variable, which 
depends on the samples drawn in rounds 1, . . . ,t— 1, and given Ot, Ot+i is a random variable, whose 
value depends on samples in round t. Also we use as the center of partition Ct in iteration 
t + 1, and E[ut+i] is the expected center. Note that all the expectations in round t are conditioned 
on t . 

Proofs. We are now ready to prove Lemmas [8] and El 

Proof. (Of Lemma E]) Let X\, . . . ,X n be the n iid samples from the mixture; for each i, we can 
write the projection of Xi along v as follows: 

(X l ,v)=Y l + Z i 

where 2f< ~ N(0, a 3 ), if Xi is generated from distribution D 3 , and Yi = (fi 3 ,v), if Xi is generated 
by D 3 . Therefore, we can write: 

(S t+ i,v) = ^ (^2 Y i • i-XieCt+i + Yl Zi ' lx i£C t +i\ 

To determine the concentration of (§t+i,v) around its expected value, we address the two terms 
separately. 

The first term is a sum of n independently distributed random variables, such that changing 
one variable changes the sum by at most maxj 2 ^'"^ ; therefore, to calculate its concentration, 
one can apply Hoeffding's Inequality. It follows that with probability at most |, 

- > Yi ■ lXieCt+i ~ E - > Yi ■ lx ie c t+1 J| > max — 

n n 3 vn 

ii 

We note that, in the second term, each Zi is a Gaussian with mean and variance a 3 , scaled 
by ||f||. For some < 8' < 1, let Ei(5') denote the event 

-<w|MlV21og(l/<$') < Zi ■ l Xi eC t+1 < ^max||w||v / 21og(l/<5 / ) 

As Zi ~ N(0,a 3 ), if Xi is generated from distribution Dj, and ljsQeCt+i takes values and 1, for 
any i, for 5' small enough,Pr[£'j(5 / )] > 1 — 6'. 

We use 8 1 — and condition on the fact that all the events {£^((5'), i — 1, . . . ,n} happen; 
using an Union bound over the events Ei(S'), the probability that this holds is at least 1 — f . We 
also observe that, as the Gaussians Zi are independently distributed, conditioned on the union of 
the events Ei, the Gaussians Zi are still independent. Therefore, conditioned on the event L)iEi(5'), 
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n Si %i ' l^eCt+i is the sum of n independent random variables, such that changing one variable 
changes the sum by at most ^IHIvf™ , We 

can now apply Hoeffding's bound to conclude 

that with probability at least 1 



2' 



\ l X- 7 ^ .^71 II <r 4cT m ax|bllv / 21og(l/^)v/21og(l/^) 8<r max |H| log(4n/g) 



n <■ — ' n <■ — ' ^Jn \ln 

ii 

The lemma now follows by applying an union bound. 
Proof. (Of Lemma [9]) We can write: 

||St+i|| 2 < H'S'i+ill 2 + \\St+i ~~ 5t+i|| 2 + 2|(5'j + i — St+i, St+i)\ 

If vi, . . . ,Vd is any orthonormal basis of R rf , then, we can bound the second term as follows. 
With probability at least 1 — |, 

d 100 1„J/ 



ic c i|2 S^t/a a / 12 8log (8n/£) ^ 2 M M 2 . V^/ j \2n 

\S t+ l ~ S t+ l\\ = 2^(\ S t+l- S t+l,Vi)) < (^^maxINI +2^^,«i>) 



The second step follows by the application of Lemma El and the fact that for any a and b, 
(a + b) 2 < 2(a 2 + b 2 ). 

2' 



Using Lemma [H with probability at least 1 



. * 81og(8n/5). .. ., .. 

{St+l - S t+ l,S t+ l) < ■== (T m ax||^+l|| +max|(6i + i,/i' y )|) 



The lemma follows by a union bound over these two above events. 
6.3 Proofs of Lower Bounds 

Proof. (Of Lemma [17]) Let P be the plane containing the origin O and the vectors fi± and H2- If 
v is a vector orthogonal to P, then, the projection of D\ along v is a Gaussian JV(0, 1), which is 
distributed independently of the projection of D\ along P (and same is the case for D2). Therefore, 
to compute the KL-Divergence of D\ and D2 , it is sufficient to compute the KL-Divergence of the 
projections of D\ and D2 along the plane P. 
Let a; be a vector in P. Then, 



1 f 1 ,,9,1 ,,9, / ip~ f-w / z + ± P - F-t-Mi / z \ 

KL(Di,D 2 ) = / (i e -ll^ill 2 /2 + i e -||x+ w || 2 /2 )ln 2^- „ 2/0 + f ^ 

V ^ V2^J xe p 2 2 ; y e -||x-M2|| 2 /2+ i e -lk+M2lP/2y 

V2^J X& P { 2 + 2 6 Jm l e -||x+M 2 H 2 /2.(i + e 2<x, M2 > ) ) ax 



i I, ( r-"- M " 2/2 + r-" I+wlP/2 » + "II" - II- + «ll 2 ) + - 
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We observe that for any x, \\x + ^W 2 — \ \x + /ii|| 2 = HM2II 2 — H^ill 2 + 2(x, //2 — Ml)- As the 
expected value of Z?i is 0, we can write that: 

( i -\\x-n\r/2 + 1 _||, +M1 ||V2 )(X) M2 _ w) = E ^ Di (X) Mi _ m) = (1) 

We now focus on the case where | \fi\ \ | >> 1. We observe that for any \x 2 and any x, l+e 2 ^'^ > 
1. Therefore, combining the previous two equations, 



KL 



(D 1} D 2 ) < -1= f||/i 2 || 2 - H/iiH 2 + / (le-ll^-wll 2 / 2 + I e -H-+HI 2 /2 )ln(1 + e 2(*,Mi>) d ^ 
V27r V AeP 2 2 / 

Again, since the projection of L>i perpendicular to fi\ is distributed independently of the projection 
of D\ along fj,i, the above integral can be taken over a one-dimensional x which varies along the 
vector fii. For the rest of the proof, we abuse notation, and use \x\ to denote both the vector \i\ 
and the scalar We can write: 

f°° (I e -(-w) 2 /2 + 1 -(*+w) a /2) l n (i + e 2 ^ x )dx 

Jx=-oo ^ ^ 

< v/2^1n2+ / (- e -(^) 2 /2 + _ e -(x+ m ) 2 /2) ln ( 1 + e 2Mix^ a , 

Jx=o 2 2 

/■OO -I 1 

< \/2^1n2 + / (- e -(^-w) 2 / 2 + - e -^+w) 2 / 2 )(ln2 + 2xin)dx 

Jx=o 2 2 

< ^l n2 + 2m ^ (I e -(-Mi) 2 /2 + I e -(x+Ml) 2 /2 )xdx 

2 2 2 

The first part follows because for x < 0, ln(l + e 2lE/il ) < In 2. The second part follows because for 
x > 0, ln(l + e 2x>Ml ) < ln(2e 2x/il ). The third part follows from the symmetry of D\ around the 
origin. 

Now, for any a, we can write: 

-L xe-^ 2 ' 2 dx = -L • e~ a2 ' 2 - a$(a, oo) 
Plugging this in, we can show that, 

KL(D 1 ,D 2 ) < -L ( ||/i 2 || 2 - Hwll 2 + ^ln2 + 2|M|( e -IMI 2 / 2 + v^F||mi||*(0, ||//i||)) | 



27r 

from which the lemma follows. 

Proof. (Of Lemma \l~8\i For each i, let each «j be drawn independently from the distribution 
^=A/"(0, Jrf) . For each let Pij = | • d(vi,Vj) and iVJy = | • d(vi,—Vj). Then, for each i and 
j, and Nij are distributed according to the Chi-squared distribution with parameter d. From 
Lemma [261 it follows that: Pr[Pjj < ^] < e ~ 3 <Vio_ A similar lemma can also be shown to hold 
for the random variables Ny. Applying the Union Bound, the probability that this holds for P^ 
and for all pairs (i,j),i £ V,j £ V is at most 2K 2 e~ 3d ^ 10 . This probability is at most | when 
A' <■'■". 

In addition, we observe that for each vector Vi, d ■ \ \vi\\ 2 is also distributed as a Chi-squared 
distribution with parameter d. From Lemma [26"1 for each i, Pr[||-Uj|| 2 > 7/5] < e _2d / 15 . The second 
part of the lemma now follows by an Union Bound over all K vectors in the set V . 
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Lemma 26. Let X be a random variable, drawn from the Chi-squared distribution with parameter 
d. Then, 

L io J - 

Moreover, 

Pr [X >-}< e~ 2d/15 
5 



Proof. Let Y be the random variable defined as follows: Y = d — X. Then, 
Pr [X < ±] = Pr[ y > ^] = Pr [ e *y > e 9*A0] < E [ e * y ] 



10 J 10 L J ~ e 9dt / 10 

where the last step uses a Markov's Inequality. We observe that E[e iy ] = e td ~E[e~ tx ] = e td (l — 
2t) d l 2 , for i < \. The first part of the lemma follows from the observation that (1 — 2t) d l 2 < e~'"' 



and by plugging in t = g. 

For the second part, we again observe that 

Fr[X > ™\ < (1 - 2t )- d / 2 e- 7dt / 5 < e- 2d '/ 5 

The lemma now follows by plugging in t = |. 

6.4 More General fc-means : Results and Proofs 

In this section, we show that when we apply 2-means on an input generated by a mixture of k 
spherical Gaussians, the normal to the hyperplane which partitions the two clusters in the 2-means 
algorithm, converges to a vector in the subspace A4 containing the means of mixture components. 
This subspace is interesting because, in this subspace, the distance between the means is as high 
as in the original space; however, if the number of clusters is small compared to the dimension, the 
distance between two samples from the same cluster is much smaller. In fact, several algorithms 
for learning mixture models [VW021 TAM051 |CR08| attempt to isolate this subspace first, and then 
use some simple clustering methods in this subspace. 



6.4.1 The Setting 

We assume that our input is generated by a mixture of k spherical Gaussians, with means ji 3 , 
variances (o" J ) 2 , j = l,...,k, and mixing weights p l , . . . , p k . The mixture is centered at the origin 
such that Yl P 3 ^ = 0- We use M. to denote the subspace containing the means fj, 1 , . . . , fi k . 

We use Algorithm 2-means- iter ate on this input, and our goal is to show that it still converges 
to a vector in M. 

In the sequel, given a vector x and a subspace W, we define the angle between x and W as the 
angle between x and the projection of x onto W. As in Sections 2 and 3, we examine the angle 
9t, between ut and A4, and our goal is to show that the cosine of this angle grows as t increases. 
Our main result of this section is Lemma [20l which, analogous to Lemma Q] in Section [3l exactly 
defines the behavior of 2-means on a mixture of k spherical Gaussians. 

Before we can prove the lemma, we need some additional notation. 
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6.4.2 Notation 

Recall that at time t, we use Ut to partition the input data, and the projection of ut along A4 is 
cos(#i) by definition. Let b\ be a unit vector lying in the subspace M. such that: 

u t = cos(9 t )bl + sm(6 t )v t 

where vt lies in the orthogonal complement of Ai, and has norm 1. We define a second vector u^- 
as follows: 

uj~ = sm(9 t )bl - cos(6 t )v t 

We observe that (ut,u^-) = 0, \\u^~\\ = 1, and the projection of on Ai is sm(9t)bj. 

We now extend the set {bj} to complete an orthonormal basis B = {b\ , . . . , fr^ -1 } of M.. We 

t 

basis of Ai, along with vt, and can be extended to a basis of RA 



also observe that {bj,...,b^ l ,ut,u^-} is an orthonormal basis of the subspace spanned by any 



For j = 1, . . . , k, we define rf as follows: 



Finally we (re)-define the quantity £j as 

-(r[f/2(ar) 2 



' 27T 

3 

and, for any I = 1, . . . , k — 1, we define: 



j 

6.4.3 Proof of Lemma [20] 

The main idea behind the proof of Lemma [201 is to estimate the norm and the projection of Ut+i; 
we do this in three steps. First, we estimate the projection of u t +i along ut\ next, we estimate 
this projection on u^-, and finally, we estimate its projection along 6f, . . . ,b t . Combining these 
projections, and observing that the projection of ut + \ on any direction perpendicular to these is 0, 
we can prove the lemma. 
As before, we define 

Z t+1 = Pv[x e C t+1 ] 

Now we make the following claim. 
Lemma 27. For any t and any j, 



Zt+i <? 3 



Pr[x ~ Dj\x e C t+ i] = $(-^r,oo) 

Proof. Same proof of Lemma [2T1 

Next, we estimate the projection of u t +i along itt- 
Lemma 28. 

z 6 + cos{e t )m\ 

(u t+ i,u t ) = - 



t+1 
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Proof. Consider a sample x drawn from distribution Dj. The projection of x on ut is distributed 
as a Gaussian with mean T 3 t and standard deviation a 3 . The probability that x lies in Ct+i is 

3 

Pt[N(ti , cr- 7 ) > 0] = $( — |, oo). Given that cc lies in C t+ i, the projection of x on u t is distributed as 



a truncated Gaussian, with mean rl and standard deviation a 3 which is truncated at 0. Therefore 



E[{x,u t )\x eC t+ i,x~ Dj] = £- [/ y - — - dy 



a3 

which is again equal to 



*(-4,oo) \Jv=Q <r j Vtor 



1 / • f» e -(rr/) 2 /2(^) 2 foo ( V _ T 3) e -(y-r/) 2 /2(^) 2 ^ 

r t 3 / dy + / ^ tJ -r-*E= d y 



$>(—li oo) V Jy=° oiyphi J y =o oi\[ 7 hx 

1 / , . r/ , p ( y - r /) e -(W) 2 /2(^) 2 N 



$(-J,oo) V aj ' -A/=o ^V^F 

We can evaluate the integral in the equation above as follows. 

poo 

( y _ rDe^-^l^dy = (^) 2 / e~ z dz = (^V^W) 3 

y=0 ^=(7f) 2 /2(^) 2 

Therefore we can conclude that 

1 e _ W )2/ 2 (^)2 

E[(x,u t )\x £ C t+1 ,x ~ Dj] = r/ + , <y* -= 

°o) ^ 2vr 

Now we can write 

{u t+ i,u t ) = y^E[(x,u f )|x ~ Dj,x £ C m ]Pr[x ~ Dj\x £ C t +i] 



— ^(j$(-^,oo)E[{x,ut)\x ~ Dj.s G C m ] 



where we used lemma |2"71 The lemma follows by recalling that T 3 t = cos(6t) (fJp , b\). 
Lemma 29. For any t, 

, u i . sin(# t )m} 

Proof. Let x be a sample drawn from distribution Dj. Since ujr is perpendicular to u t , and Dj is 
a spherical Gaussian, given that x £ Ct+\, that is, the projection of x on u t is greater than 0, the 
projection of x on uj- is still distributed as a Gaussian with mean \u^) and standard deviation 
a° . That is, 



E[(x,ut)\x ~ G Ct+i] = (fM 



Also recall that, by definition of uj~, uj~) = sm.{6 t ){^ ,b\) . To prove the lemma, we observe 
that (ut+i,Ut~) is equal to 

J2 E [(x,ui)\x ~ D i5 x G C m ]Pr[x ~ Z^x G C t+1 ] 
j 

The lemma follows by using lemma [271 
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Lemma 30. For I > 2, 

ml 



Proof. Let x be a sample drawn from distribution Dj. Since b\ is perpendicular to lit, and Dj is 
a spherical Gaussian, given that x G Ct+i, that is, the projection of x on u t is greater than 0, the 
projection of x on b\ is still distributed as a Gaussian with mean ,b\) and standard deviation 
er J . That is, 

E[(x,^)|x~D i ,xGC m ] = (^,^) 
To prove the lemma, we observe that (6',u t+ i) is equal to 

^E[(x,b\)\x ~D s ,x€ C t+1 ]Px[x ~ Djlx G C m ] 

i 

The lemma follows by using lemma l27l 

Finally, we show a lemma which estimates the norm of the vector Ut+i- 
Lemma 31. 

k 

Ik+ill 2 = -J-fe 2 + 26cos(0 t K 1 +^(mJ) 2 ) 
'+ 1 l=i 

Proof. Combining Lemmas [28l [291 and [30l we can write: 

IK+ill 2 = (ut, u t +i) 2 + (u^, u t +i) 2 + ^2(bi, iH+x) 2 

l>2 

k 

= 4r~ Ut + % oos^K 1 + cos 2 (0 t )K) 2 + sin^^K 1 ) 2 + ^(mj) 2 

Z *+l V Z=2 

The lemma follows by plugging in the fact that cos 2 (6»i) + sin 2 (6> t ) = 1. 

Now we are ready to prove Lemma |2"U1 
Proof. (Of Lemma [20]) Since b\, . . . , b^ form a basis of A4, we can write: 

cos 2 (*m) = E 'V" f+ ,^ )2 (2) 
\\ u t+l\\ 

\\ut+i\\ 2 is estimated in Lemma [3T| and (ut+i,b\) is estimated by Lemma [29l Using these lemmas, 
as b\ lies in the subspace spanned by the orthogonal vectors Ut and uj~, we can write: 

{ut+iM) = {ut,u t+ i){u t ,b\) + {uj-,ut+i){ut,b]) 
cos(g t )6 + rn\ 
Zt+i 

Plugging this in to Equation [2j we get: 

2 (B s = & cos 2 (^) + 26 cos^K + T.M) 2 

COb 1 mj el+26cos(0 t K+E^K) 2 

The lemma follows by rearranging the above equation, similar to the proof of Lemma [TJ 
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